# Full text of "Adaptive Monotone Shrinkage for Regression"

## See other formats

arXiv: 1505.01743vl [stat.ME] 7 May 2015 Adaptive Monotone Shrinkage for Regression Zhuang Ma Dean Foster Department of Statistics Yahoo Lab University of Pennsylvania New York, NY 10018 Philadelphia, PA 19104 dean@foster.net zhuangma@wharton.upenn.edu Abstract We develop an adaptive monotone shrinkage es¬ timator for regression models with the following characteristics: i) dense coefficients with small but important effects; ii) a priori ordering that in¬ dicates the probable predictive importance of the features. We capture both properties with an em¬ pirical Bayes estimator that shrinks coefficients monotonically with respect to their anticipated importance. This estimator can be rapidly com¬ puted using a version of Pool-Adjacent-Violators algorithm. We show that the proposed monotone shrinkage approach is competitive with the class of all Bayesian estimators that share the prior in¬ formation. We further observe that the estima¬ tor also minimizes Stein’s unbiased risk estimate. Along with our key result that the estimator mim¬ ics the oracle Bayes rule under an order assump¬ tion, we also prove that the estimator is robust. Even without the order assumption, our estima¬ tor mimics the best performance of a large family of estimators that includes the least squares es¬ timator, constant-A ridge estimator, James-Stein estimator, etc. All the theoretical results are non- asymptotic. Simulation results and data analysis from a model for text processing are provided to support the theory. 1 INTRODUCTION Feature selection and coefficient estimation are familiar topics in both statistics and machine learning communities. Many results in this area concern models that are "nearly black,’ possessing a handful of large effects against a wide field of noise. Consider the widely used linear model Y = X0 + e where e ~ N( 0, cr 2 I n ) , (1) X is full-rank, n x p matrix of explanatory features with p < n, and 0 is a p dimensional vector of unknown coeffi- Robert Stine Department of Statistics University of Pennsylvania Philadelphia, PA 19104 stine@wharton.upenn.edu dents. In the "nearly black’ case, all but a few of the coor¬ dinates of 0 are zero. A long sequence of results leverage this sparsity (Foster and George [1994]; Tibshirani [1996]; Abramovich et al. [2006]; Candes and Tao [2007]; Fan and Lv [2008]; Bickel et al. [2009]). Sparsity assumptions are well suited to many applications, especially within the field of signal and image processing (Donoho [1995], Wright et al. [2009]). Despite the prevalence of research on sparse models, some applications do not conform to this paradigm. For example, Foster et al. [2013] used methods such as latent semantic analysis, essentially principal components analysis (PCA), to convert text into features for regression analysis. The estimated coefficients of these principal components show two specific characteristics that draw our attention: dense coefficient estimates with a monotonically decaying effect size. Rather than concentrate in a few estimates, the pre¬ dictive power of the model spreads across many features. Sparsity-based methods such as hard or soft thresholding that set small effects to zero produce fitted models with greatly diminished predictive ability. Too much predic¬ tive signal has been lost by eliminating small, but nonethe¬ less informative, coefficients. Dense coefficients appear in other applications as well. Hall et al. [2009] and Dicker [2011, 2012] also propose models for dense signals and Dicker [2011, 2012] discusses several shrinkage estimators in high dimensions. The second characteristic of this application is the mono¬ tone decrease in typical effect size. The signal tends to concentrate in the leading principal components, then grad¬ ually decay. We may not know the signal strength, but we do have an ordering. In this sense, the unsupervised PCA of the text data provides useful information that can be ex¬ ploited within the regression. In particular, the eigenvalues from the PCA provide an external ordering of the features that is suggestive of the effect size. Such exogenous infor¬ mation appears in other domains. In time series analysis, data collected more recently are expected to be more in¬ formative for the prediction of future trends. In principal components regression, we tend to expect the first principal component to be more important than the second. There¬ fore, models without incorporating this prior knowledge might be suboptimal. In this paper, we capture both characteristics with an em¬ pirical Bayes estimator that shrinks coefficients monotoni- cally with respect to their anticipated importance. The pro¬ cedure is tuning free and can be efficiently implemented using Pool-Adjacent-Violators algorithm. We further show that the estimator can be derived from frequentists’ per¬ spective as well by minimizing Stein’s unbiased risk es¬ timate. Finally, we establish non-asymptotic results to gau- rantee that the proposed estimator is nearly Bayes optimal under the order assumption and even when the order as¬ sumption (or say, prior knowledge) is wrong, it still mim¬ ics the best performance of a large family of estimators that includes the least squares estimator, ridge estimator, James- Stein estimator, etc. I p . Then the Bayes rule p* in (1) and (2) is: where p = (pi, ■ ■ ■ , P p ) = X T Y is the least squares esti¬ mator. The Bayes rule shrinks pi monotonically, shrinking more and more harshly as the index and of increase. For our application, we know only the order of the features, not the signal strength of, so the Bayes rule is not a real es¬ timator because it depends on the unknown parameter of. To mimic the performance of the Bayes rule, we estimate the of s from data under the order constraint and use the resulting plug-in estimator. For convenience, we write the model as /? ~ N( 0, E) where E = diag(a\ , • • • , of). 2.2 Maximum Marginal Likelihood Estimator The rest of this paper is organized as follows. In sec¬ tion 2, we describe the monotone shrinkage model in de¬ tail and introduce the maximum marginal likelihood esti- mator(MMLE) and Pool-Adjacent-Violators algorithm. In section 3, we show that the proposed estimator also mini¬ mizes Stein’s unbiased risk estimate and establish its non- asymptotic oracle properties both with and without order assumption. In section 4, we suggest an estimator of the er¬ ror variance a 2 . In section 5, we present simulation results and an analysis of text to support our theory. Concluding remarks are given in section 6. Details of the technique are provided in the Appendix. 2 ADAPTIVE MONOTONE SHRINAKGE 2.1 Model Formulation We use a Bayesian framework to encode the prior knowl¬ edge about the importance of explanatory features in our model. We express this prior knowledge in a distribution of the coefficients (3. Intuitively, if the features within the regression are standardized such that X T X = I p , then the coefficient |/3j| gives the importance of the ith feature: a unit change in X t is associated with a change of |/3j| in the response. A natural prior that captures the sense that the elements of p have decaying size specifies a monotone decreasing sequence of variances for the coefficients. For convenience, we assume that the size of signal in pi is de¬ creasing with the index i: Pi ~ N(0, af) of > of > • • • > a 2 p > 0 Since we know the order, we can always rearrange Pi so that the unknown of are monotone as above. Throughout, we consider only orthonormal designs for which X T X = To estimate the ordered prior variances on the diagonal of E, we observe that the marginal distribution of Y is N( 0, XYjX t + cr 2 I n ). If we further assume the error vari¬ ance a 2 from (1) is known (or we can plug in a consistent estimator), a simple calculation shows that the least square estimator p = X T Y is a sufficient statistic for E. So, in the following discussions, we base our inference on p, whose marginal distribution is N(0,cr 2 I p + E). A natural esti¬ mator of E is the maximum marginal likelihood estimator (MMLE). The log marginal likelihood function is ((E) = ^log(27r) + logO 2 + of) + j Consequently, the MMLE is the solution to the following optimization problem: arg min yf log(o- 2 + of) + *=i V (T 2 + a 2 (3) subject to v 2 i>v%> ■■■><?;> o 2.3 Pool-Adjacent-Violators Algorithm The optimization problem (3) resembles the well-known isotonic regression problem n p lso = arg min yf (y t — Pi) 2 subject to P i > • • • > p n ■ i= 1 (4) whose unique solution can be efficiently obtained by running the Pool-Adjacent-Violators (PAV) algorithm. Roughly speaking, this algorithm solves (4) as follows. Set i = 1. Move to the right (increase the index i) until finding a pair (?/,:, y t +i) that violates the monotonicity constraint, that is yi < Vi+i- Pool y, : and the adjacent Di+i and replace both by their average. Next check whether r/j_i < ?;,+ f ,+1 . If so, replace (j/*-i. y.j, y i+ 1 ) with their average. Continue to the left until monotonicity is satisfied and then proceed to the right until the whole sequence is monotone. Hence, PAV algorithm outputs an decreasing blockwise constant sequence. As far as we know, the PAV algorithm dates back to Ayer et al. [1955], where it is used to compute the MLE of independent binomial distributions. Brunk [1955, 1958] considered rather general scenarios and established some consistency properties. According to Grotzinger and Witz- gall [1984], if carefully implemented, the PAV algorithm has computational complexity 0 (n). Although the optimization problem (3) is not convex, it can be solved efficiently by the PAV algorithm. Before estab¬ lishing this result, we introduce some notations. fi(x) = log (a; + cr 2 ) + ^ X + <7 Z of = arg min/j(a;) = P? - o 2 X Proposition 2.1. The following two-step algorithm pro¬ duces the MMLE denoted by (a 2 , • • • , a 2 ) Step!, (of, ■ ■ ■ , a 2 ) = PAV(of,■■■ ,<r 2 ) Step 2. a 2 = dfl {6 |> 0) . For those of estimated to be 0, it means the corresponding features are not included in the model. To prove the proposition, we first introduce a lemma: Lemma 2.1. Consider optimization problem p min ^/i(0i) subject to: 9\ > 62 > • • • > 9 V where the elementwise solution 61 = arg mine fi(9) is fi¬ nite. If {fi, ■ ■ ■ , f p } satisfies the pooling condtion defined below, then the optimization problem has a unique solution (0i,- J P )=PAV(e ir -- ,e p >. Definition 1 (Pooling Condition). For a sequence of func¬ tions {fi,--- ,/p} with 9i = arg min 0 fi(9) being finite. Let 6 i:j = J2l=i &k/(j ~i + !)• We say (A, - - - , f p } sat¬ isfies pooling condition if Mi < j, argmin fk(fi) = 9ij and fk{@) is strictly decreasing when 9 < 9ij and strictly increasing when 9 > 9ij. (of,-- - ,of) = PAV ( of, of) solves: min ^(log(a 2 + of) + a 2^ a2 ) subject to: of > of > ■■■ > of which is only slightly different from our original optimiza¬ tion problem. To finish the proof, we just need to introduce some auxiliary functions. Let /_*, = log (o 2 + <x/ k ) + ,1 < k < n, so d 2 _ k = argmin f- k (o 2 _ k ) = 0. ' — k Consider the following optimization problem (Q n ): Lemma 1 shows PAV(<x 2 , • • • , of, 0, • • • ,0) solves ( Q n ). Denote it (of^ • • • , o 2 npl o 2 _ nn , ■■■ , o 2 nl ). Notice that (of, ■ ■ ■ , of, 0, • • • , 0) is a feasible solution, which should be suboptimal, that is to say, ^(log(cr 2 + &f 2i ) H- 2 V -2 ) + E < £(]og(o* + + ff 2^ 2 ) + E Recall that df_ k = argmin f-k(cr 2 _ k ) = 0, which implies E(!0g(0- J + O-fi) + 2^2 ) i= 1 m < J2(\og(o 2 + of) + J' 2 ) i= 1 ' l Let n goes to infinity, PAV(ct 2 ,--- ,df, 0, • • • ,0) con¬ verges to (dfl {d . 2 > 0) , ■ • • , ct 2 / ( ^ 2 > 0) , 0, • • • , 0) and the in¬ equality above implies (offi^y^,--- ,ofl^ 2 >o)) is the solution to the original optimization problem. □ The pooling condition can be easily checked and numerous distributions have log likelihood functions that satisfy the condition. These include the binomial distribution, Pois¬ son distribution, normal distribution with fixed variance and variable mean, normal distribution with fixed mean and variable variance, and so on. Proof of Proposition 2.1 The situation we are faced up with is normal distribu¬ tion with fixed mean and variable variance, which satis¬ fies the conditions in Lemma 1. According to the lemma. 2.4 Data-Driven Blockwise James-Stein Estimator: Global and Local Adaptivity Proposition 1 and the nature of Pool-Adjacent-Violators algorithm show that the MMLE (if 2 , • • • , <f 2 ) is decreas¬ ing and blockwise constant. We now change notation in this part and let of denote the common variance estimate of the zth block. Define /% = (fin , Pi ni ) to be the coefficients within the fth block and correspondingly de¬ fine @i, j!li. With these notations, we can write explicitly Y^LAPij-° and Pi of + o 2 (i- S"ii P?i ) + Pi i which is exactly the positive part of the James-Stein type estimator. Hence the proposed estimator can be interpreted as a monotone blockwise James-Stein estimator. Block- wise James-Stein estimator is well-studied in the wavelet setting (Cai [1999], Cai and Zhou [2009]). In Cai [1999], the block size is fixed before observing the data and is the same for all blocks. Cai and Zhou [2009] proposed an adaptive procedure to make the block size data-driven but the block size remains the same for all blocks. As for our procedure, the number of blocks and the size of each block are completely data-driven. The difference is due to dif¬ ferent assumptions. The former is based on smoothness of Besov bodies while the later is based on monotonicity. The advantage of our data-driven, monotone blockwise James-Stein estimator is the ability to achieve both global and local adaptivity. Blockwise shrinkage utilizes informa¬ tion about neighboring coefficients. However, if the block size is too large, local inhomogeneity might be overlooked. So, the best way to achieve a good balance is to let the data speak for itself. 3 ORACLE RISK PROPERTIES 3.1 Equivalence between MMLE and SURE Estimator In this section, we show that our empirical Bayes estima¬ tor can also be derived within a frequentist framework by minimizing Stein’s unbiased risk estimate (SURE). Under squared error loss: 10, p) = \Y7i=i0i ~ Pi) 2 ■ If one uses the shrinkage estimator /3 A defined by p x = x .+ a2 Pi to estimate fa, the risk for a given /3 is: R P 0\p) = E[10\P) I A?) P ' A i) and an unbiased estimate for the risk is SURE{ A) 1 p 2 32 i—1 o 2 + A , VP: a 2 (Xi - cr 2 ) a 2 + A i the monotone shrinkage parameter A = (Ai, • • • , A p ). If only considering the family of monotone shrinkage esti¬ mators, the relationship suggests that the data-dependent A which minimizes SURE(A) should be a good choice. De¬ fine: Xsure = arg min Ai>--->A p >0 2 Pi- a 2 (X j-a 2 ) G 2 + A; which is of the same form as optimization problem (3). Let <7i(A») = {0^x~) 2 P 2 + g 1 • Then it is easy to see that A i = argmingj(Aj) = 0 2 — a 2 . Checking that gi(X t ) satisfy the two conditions in Lemma 2.1, the same argu¬ ment used to show Proposition 2.1 implies: Proposition 3.1. MMLE equals SURE estimator P^sure. In the rest of the paper, we will use Psure = p XsuRE to denote the proposed estimator. Remark 3.1. Monotone shrinkage estimator was also in¬ vestigated in Xie et al. [2012] when dealing with het- eroscedastic normal sequence model. Different empirical Bayes estimators were studied in this paper and SURE es¬ timator was shown to dominate MMLE and method of mo¬ ments. While in our context, the three estimators turned out to be the same. 3.2 Oracle Property with Order Assumption Proposition 3.1 provides us with a powerful tool to inves¬ tigate the risk properties of the proposed estimate. First of all, we introduce the oracle estimator, namely the Bayes rule (3* = {Pi,-- - ,Pp) defined by p: = of + o ■ ;Pi Of course, /?* is not an practical estimator because it de¬ pends on the unknown parameter A* = (of, ■ ■ ■ ,cr 2 ). It is easy to see the oracle risk is: i P 2 2 p (T 2 + of 1=1 1 Then we introduce another lemma, which is the building block of the oracle properties. It says that E[SURE{\)] is uniformly good approximation of the true risk E[l0fa)], where the expectation is with respect to both data and pa¬ rameter fa Lemma 3.1. Generally, SURE(A) is unbiased estimate of the risk only if A is a fixed constant and cannot depend on data. We say p x is a monotone shrinkage estimator i f ft = j^ffPi and Ai > • • • > A p > 0, where A , can be data dependent. A monotone shrinkage estimator is completely determined by sup sup \E{E[10 X ,P)-SURE{X)\P]}\ < L, 2 o 2 Xl>-->X P » P where A = (Ai, • • • , \ v ) is arbitrary monotone shrinkage parameter and can be data dependent. Theorem 3.1. Take expectations, we have sup (. R0sure ) - R(P*)) < 4< f-o 2 Vl>-><7%>0 V P Proof: Because A* is fixed constant, we have R0SURE , P) - R(P*, p) = E[10sure,P)\P) - E[SURE(X*)\P\ = E[10sure,P) ~ SURE(X S ure)+ SURECXsure) ~ SURE(X*)\P] < E[1(Psure,P ) - SURE(Xsure)\P] The inequality is due to the definition of A sure- So the Bayes risk satisfies: R{Psure ) - R{P*) < E{E[1(Psure, P) ~ SURE(Xsure)\P ]} < sup sup \E{E[l(p % ,p)-SURE(X)\p}}\ of- Ai>-->Ap>0 R{Psure ) - R(fa) < E{E[l(P, Psure)- SURE(Xsure) + SURE(fj) - l(P,fa)\P}} <2 sup sup \E{E[l(P~ x ,P)-SURE{X)\p]}\ of — ,ol Ai>--->Ap>0 Lemma 3.1 implies, R{Psure ) — inf R(Py) < 8i j-o 2 + e 7l>->7p>0 V P Since the upper bound does not depend on erf, let e —* 0, and the theorem follows. □ If we replace the data dependent shrinkage parameters in Theorem 2 with fixed ones, we can improve the error bound by a factor of 2, which is Corollary 3.2. sup (R(Psure) ~ inf -R(/3 7 )) < 4-i/—cr 2 ct 2 ... j0 .2 7l>-">7p>0 V P Applying lemma 3.1 finishes the proof. □ Remark 3.2. Theorem 3.1 shows that SURE estimator mimics the oracle Bayes rule and therefore outperforms all other estimators. What needs to be highlighted is that this is a non-asymptotic result with rate of convergence 0 (p~ 2 ) independent of the true erfs. No matter how the erfs vary, as long as the order is known, the proposed adaptive pro¬ cedure can uniformly capture the truth. Corollary 3.1. sup = 1 + 4,/| of>-><?l> 0 V 3.3 Oracle Property without Order Assumption In this section, we show that even without knowing the or¬ der of the erf’s, the proposed estimator retains an oracle property among monotone shrinkage estimators. Theorem 3.2. sup ( R0SURE ) - inf R(P^)) < 8 \ -a 2 0-2,... i(T 2 7l>'">7p>0 VP Proof: For any given (of, ■ ■ ■ ,of), we can always find f) = (f/i, • • • ,r) p ) that satisfies fji > • • • > fj p > 0 and R(Pfj) < inf R(Py) + £• Then, 7!>->7p>0 R(Psure)— inf R(Py) < R(Psure) — R{Ptj)+e 7l >...>7 P >o Notice that, l(P, Psure)~ 1(P, Pfj) = (1 (P,Psure)~SURE(Xsure + (SURE(XsuRE)-SURE(p))+(SURE(fi)-l(p,pr ) )) < ( l(P , PsuRE)-SURECXsuR E ))+(SURE(fj)-l(P, fa) Theorem 2 shows that even when the order assumption is invalid, the proposed estimator is nearly the best in the family of monotone shrinkage estimators. In particular, uniform shrinkage estimators such as least square estima¬ tor, ridge estimator and James-Stein estimator and step¬ wise regression methods such as monotone AIC, BIC, RIC (just search for p nested submodels: with ith submodel as {1, • • • , *}) are included. This is also a non-asymptotic re¬ sult with rate of convergence 0 (p~ 2 ) independent of the erfs. Therefore, the proposed estimator is robust and good enough for practical use. Actually, Theorem 3.2 states about the worst case. If the order is partially right, the proposed procedure benefits where the order is right and retains good properties where the order is wrong. Remark 3.3. The robustness is due to the ‘soft constraint ’. Instead of restricting the norm of regression coefficients to be monotone, we incorporate the constraint in the prior distribution, which makes the model flexible and robust. 4 ESTIMATION OF a 2 We have assumed o 2 is known to establish the theoretical properties of our estimator. Here we suggest a reasonable estimate in practice that is based on maximum marginal likelihood. Unlike section 2.2, within this section the un¬ known parameter becomes 6 = (of, ■ ■ ■ ,of,o 2 ). Recall that the marginal distribution of Y is N( 0, X'EX 1 '+o 2 I n ). So, the log marginal likelihood function: l(0\y) oc -log(\XXX T +o 2 I n \)-y T (XXX T +o 2 I n )- 1 y where j • | means determinant. Let X = (xi,--- ,x p ), we can add another n — p vectors x p +i, • • • , x n to make X = (A", ajp+i, ■ ■ ■ , x n ) an orthonormal matrix. Let E = diag{Y, + a 2 I p ,cr 2 I n - p ). Then AEX T + a 2 I n = XtX T and thus (AEA T + cr 2 In) -1 = AE _1 X T . Plug this ex¬ pression back into the marginal likelihood function, l{e\y) (X -log(\XtX T \) - y T Xt~ 1 X T y We abuse notation in this section and let f} = X T y. If we introduce variable (rf,, r 2 ) = diag(T,) = (cr 2 + a 2 , - ■ ■ , cr 2 + cr 2 , (7 2 , • • • , a 2 ), then K s ) « -^(iogr 2 + »=i * So, the MMLE is the solution to the following optimization problem. min^^ogr 2 + i= 1 ^ subject to: if > • ■ ■ > r 2 > r 2 +1 = ■ • • = r 2 > 0 Following a similar but slightly different argument in Proposition 2.1, we obtain the following result and omit the proof. Proposition 4.1. I'he solution is uniquely given by (a j 11 1 ) E? 1 A, PAV0 2 ,- B 2 5 rp 5 n — p n — p 77ze MMLE of original parameters can be recovered by CaV" ,r 2 f 2 +1 ,--- ,f 2 ) = (<r? + d 2 ,-- * 2 ,*V-- ,<r*). . + 5 NUMERICAL EXPERIMENTS 5.1 Simulation Results In this section, we compare the proposed monotone shrink¬ age approach with several other popular methods for fea¬ ture selection and estimation. For simplicity, we only con¬ sider the normal sequence model and assume the error vari¬ ance cr 2 is known. • PAV, the proposed adaptive monotone shrinkage pro¬ cedure computed by Pool-Adjacent-Violators algo¬ rithm. • Lasso, with A selected by minimizing Stein’s unbi¬ ased risk estimate. Under orthogonal design, it is also known as Sureshrink (Donoho and Johnstone [1995]). • Ridge estimator with A selected by Cross-Validation. • Positive part of James-Stein estimator • Classical stepwise regression, we use AIC for penalty criterion. • Monotone AIC: AIC that just searches for p nested submodels, i.e., with kth submodel={l, • • • , fc} We consider the following scenarios (p = 100, cr 2 = 1): 1. Signals with Decaying Size: (cr 2 , • • • , cr 2 ) are gener¬ ated from decreasing order statistics of 2x 2 . 2. Signals with Same Size: a 2 = 2, VI < i < p 3. Sparse Signals: first 90% of the cr? are 0 and remain¬ ing 10% of cr 2 are generated from decreasing order statistics of 4\ 2 . 4. Signals with Increasing Size: (cr 2 , • • • , a 2 ) are gener¬ ated from increasing order statistics of 2\- 2 . This sce¬ nario dose not satisfy our order assumption(actually, the worst case), which is used to show the robustness of our procedure. With (cr 2 , • • • , cr 2 ) fixed, we adopt the following simula¬ tion strategy. L Generate /? == (/3 X , • • • , /3 P ) by Bi ~ N( 0, cr 2 ) 2. Condition on /3, generate the observation X = (xi , * * * ,x p )by Xi ~ N{Bi,a 2 ) 3. Use the methods discussed above to estimate the sig¬ nal B and compute the mean square error. 4. Repeat 1-3 for 400 times. The average of the mean square errors is an estimate of the Bayes risk. Mean square error for different /? are given below using box plot and the middle line of each box represents Bayes risk of each procedure. The red line in the figure stands for the oracle risk, i.e., the Bayes risk of the oracle Bayes rule. PAV James-Stein Ridge Lasso Montone AIC AIC Figure 1: Signals with decaying size, i.e. {a 2 } is decreasing. The adaptive monotone shrinkage procedure pools signals of similar sizes together and shrinks blockwisely and monotonically. For signals with decaying size, oracle estimator shrink monotonically with respect to the size of the signals. Uni¬ form shrinkage estimators such as ridge and James-Stein estimator are suboptimal. The proposed adaptive monotone procedure makes use of the prior information and mimics the oracle Bayes mle by pooling signals of similar size to¬ gether so that it shrinks blockwisely and monotonically. AIC overfits the data while monotone AIC makes use of the order structure and therefore performs better. PAV James-Stein Ridge Lasso Montone AIC AIC Figure 2: Signals with same size, i.e. of s are the same. This is the case where ridge and James-Stein estimator capture the truth with full power while our procedure will regard the of as differ- ent(decreasing) and will generally divide the of s into more than one blocks, which leads to slight power loss. For signals with same size, the oracle estimator shrink uni¬ formly. Ridge and James-Stein estimator mimic the oracle Bayes rule with full power. The proposed adaptive proce¬ dure does not necessarily gaurantee uniform shrinkage but the power loss is negligible. PAV James-Stein Ridge Lasso Montone AIC AIC Figure 3: Sparse Signals, i.e. the size of the signals remain decreasing while 90% of them are 0. The proposed monotone shrinkage procedure can effectively kill the noise and shrink the signals properly. For sparse signals, the oracle estimator kill the noise and shrink the signals monotonically. Monotone AIC can effi¬ ciently distinguish signal and noise while does not shrink the signals. The proposed adaptive procedure not only kills the noise but also shrinks the real signals properly accord¬ ing to their sizes. For those methods that cannot make use of the order structure. Lasso does better in this sparse case. -n-1-n-1-1-1— PAV James-Stein Ridge Lasso Montone AIC AIC Figure 4: Signals with increasing size, i.e. {erf} is increasing, which is opposite to our assumption that the size of signals is decaying. The adaptive monotone shrinkage procedure is robust and performs as good as other estimators. For signals with increasing size, the proposed estimator uses completely reverse order. As theorem 2 expects, wrong prior knowledge won’t min our estimator. It still mimics the best performance of the monotone shrinkage family. However, monotone AIC, which is not as robust as our procedure, suffers a lot from wrong prior knowledge. 5.2 Analysis of Text Processing Data In this section, we apply the proposed adaptive monotone shrinkage approach to text data of real estate described in Foster et al. [2013]. The features included in the regres¬ sion model are the leading 1500 principal components of the bag-of-words of text. The response is the log transfor¬ mation of the real estate price. We use the eigenvalues from PCA to order the effect size of the features (see Figure 5 for the absolute t-statistics of the leading 500 principal compo¬ nents). Although the data dose not ideally satisfy the as¬ sumptions of our model, the proposed adaptive procedure is robust enough to leverage this rough prior knowledge. Absolute t-statistics of the leading 500 Principal Components Figure 5: Absolute t-statistics of the leading 500 Principal Com¬ ponents. Those above the red line are significant. The sample size is 7384 and we use 10 fold cross validation to estimate the prediction error of each procedure. Prediction Error by 10 fold Cross Validation Figure 6: Prediction error comparison of different methods. Las- soSURE: Lasso with tunning paramter selected by minimizing Stein’s unbiased risk estimate. LassoCV: Lasso computed by LARS (Efron et al. [2004]) and paramter tuned by cross valida¬ tion. The result shows that • PAV outperforms Ridge regression. From Figure 5, we can see that the signals are of different sizes. Uni¬ form shrinkage method shrink the important features too much while shrink weak signals less harshly than it should be. PAV can adaptively pool the signals of similar size together and shrink blockwisely and monotonically. • PAV outperforms LassoSure and LassoCV. Lasso can capture the sparse pattern of the data but as sacrifice, it might shrink important features a bit more than they should be. • PAV outperforms Monotone AIC. Both procedures make use of prior information but PAV is more robust. There are several informative principal components corresponding to small eigenvalues so that Monotone AIC will exclude them from the model. 6 CONCLUSIONS In this paper, we proposed an adaptive monotone shrink¬ age approach for regression with features of ordered effect size. We showed that the procedure can be rapidly com¬ puted via Pool-Adjacent-Violators algorithm and holds ora¬ cle risk properties. Non-asymptotic results are established. Furthermore, although the procedure is based on knowing the right prior knowledge about the features, we proved that, when the prior knowledge is wrong or in the absence of prior knowledge, the estimator still mimics the best per¬ formance of the family of monotone shrinkage estimators. Hence, it is robust enough to use in practice. Compared with penalized least square methods which re¬ quire heavy computational effort to find the best regular¬ ization paramter, the proposed adaptive procedure is tuning free. As noticed in the analysis of text data, the monotone shrinkage approach naturally works with PCA since the principal components are essentially ordered and orthogo¬ nal. Recent developments in randomized algorithms(Halko et al. [2011]) enable us to quickly compute the PCA of a huge matrix so that the proposed procedure can be easily applied to large-scale datasets. 7 APPENDIX 7.1 Proof of Lemma 2.1 It is sufficient to prove the following two claims: i) (01, • • • , ^)=PAV(0i, • • • , 0 k ), l^k<p ii) VI <i,j < k, Of = Of => 9 f = 0f,Mk < m < p We prove the claim by induction: 1. It is trivial for k = 1 since 0\ = 0\ 2. Assuming the claim holds for k and we are going to prove it is valid for k + 1 as well. Case 1 9 k+ 1 < Of. It implies that E/*(<??)+/*+ 1 &+ 1 ) i=1 = n rnir U fi( e i ) + min/fc+i(0fe+i) >k —" Vk + 1 Z = 1 k +1 < min fi(0i 0i>->0fc+i-4v V 1=1 and therefore. //i/c+1 r\k+ 1\ _ (nk r\k 7\ \ lyi — (yi>*" 5 = PAv(o ir -- ,e k ) So we prove claim i). Notice that 0fX\ f 0f +1 , Vj < k, claim ii) is true by induction. Case 2 0 k +\ > Of. Denote j the smallest integer such that Of = 0f +1 = • • • = Of. Because the boundary is not active between 0j -1 and 0j, by the pooling condition, we can conclude that (0f,0f +1 ,--- ,0f) = arg min Zf=j M°i ) and U j V k of = ■■■ = Of = Zlj Oi/(k - j + 1) < 0 k+ 1 . We claim: 0™ = 6j+ 1 = • • • = 9™ + i,Vk < m < p. By induc¬ tion we have already known that O'J 1 = 0f x +l = • • • = 0™. If < 0 k +i, then by pooling condition, f k+1 (0 k+ i) is strictly decreasing on (0,0]?), which forces 0™ +1 = Of 1 . If Of 1 > 4+i, then 0Jf +1 > 0 k+1 since f k +i(0) is uni- modal and achieves minimum at 0 k +i- Because 0 k+ 1 > Zi=j 0i/{k — j + 1), again the pooling condition forces Of = 0j\ 1 = --- = Of = Ztj 0i/(k-j + 1) and hence Of = 0f + i = ■ ■ ■ = Of = 0f + 1 - Therefore we proved Of =0f +1 = ■■■ = 0f +1 ,Vm > k. Specifically, §f +1 = oftl = ■■■ = ofll If Z k i=f 0i/(k - j) < 0f_ } , still by pooling condition, 0f +l = • • • = 0f+\ = Zf=j 0i/(k-j) and consequently 0f +1 — Of, 1 < i < j — 1. We are done because the solution is exactly PAV(#i, • • • , 0 k +i), which proves claim i) and Of = 0f +1 = ■ ■ ■ = 9f +l .\/m > k implies claim ii). If ff - > Of-i, assume i to be the smallest integer such that Of = 0f +1 = ■ ■ ■ = 0f_ v By similar argument, we can prove that Of = 0f_ 1 = ■ • ■ = 0fc!|_i,Vm > k. If ZtZi 0t/{k — i) < 0f_ i, we are done. If not, continue the same argument. □ 7.2 Proof of Lemma 3.1 Plug in the expression of SURE(A), we have E[1(P X ,P)-SURE(\)\0] 2A; A (ft 2 - Pifc - a 2 ) -0f-a 2 -p 2 m = e[- y P i=1 <T + A i Take expectation with respect to /?, we get E{E[l(p x ,p)-SURE(Xm} = E{E[~ Y ^V(& 2 - AA - Olft} P + A* Notice that ft|ft ~ ) and the marginal distribution of ff is N(0, a 2 + of), we change the order of expectation and get: Let Mj = Zi =1 ( z i ~ 1)’ then Mj is a martingale. So the L 2 maximal inequality implies: E( max Mf) < 4 E(M 2 ) = 8 p i <j<p \E{E[l(P^,P) — SURE(\)\P}}\ <2a 2 f?{maX p |i^(Z l -l)|} ^ 7 (E( i<S Mi)2) ' Combine the two inequalities, we have I E{E[l(P x ,P) - SURE(X)\p]}\ < 4^a 2 Since the error bound does not depend on A and of, the lemma follows. □ E{E[l{P k ,P)-SURE( A)|ft} = Pi-A\ ‘P^cr 2 + \i'<? 2 + <?t where the expectation is with respect to ft ~ N(0, a 2 + \E{E[l(P x ,P)-SURE(X)\P]}\ < P^[a 2 + \i <2 o 2 E{ sup *|Y>( ^ 2 - 1)|} l>ci>->c p >0 P ° 2 + °i i p = 2 cr 2 E{ sup -| Y c d z i ~ 1)1} l>ci>-”>c p >0 P where Zi ~ i.i.dx 2 ■ Observe that sup i>(z,. 01 l>ci>-">c p >0 P i=1 1 j = max I - V (Zi — 1)1 1=1 which is also used in Lemma 7.2 of Li [1985] and Theorem 3.1 in Xie et al. [2012], we have: \E{E[1(P‘ X ,P)-SURE(X)\P]}\ < 2o 2 E{ max |-^(Z i -l)|) 1 <]<P P 1=1 References Felix Abramovich, Yoav Benjamini, David L Donoho, and Iain M Johnstone. Adapting to unknown sparsity by con¬ trolling the false discovery rate. The Annals of Statistics, 34(2):584-653, 2006. Miriam Ayer, H Daniel Brunk, George M Ewing, WT Reid, and Edward Silverman. An empirical distribution func¬ tion for sampling with incomplete information. The An¬ nals of Mathematical Statistics, pages 641-647, 1955. Peter J Bickel, Ya’acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, pages 1705-1732, 2009. H. D. Brunk. Maximum likelihood estimates of monotone parameters. The Annals of Mathematical Statistics, 26 (4):607-616, 1955. H. D. Brunk. On the estimation of parameters restricted by inequalities. The Annals of Mathematical Statistics, pages 437-454, 1958. T Tony Cai. Adaptive wavelet estimation: a block thresh¬ olding and oracle inequality approach. Annals of statis¬ tics, pages 898-924, 1999. T Tony Cai and Harrison H Zhou. A data-driven block thresholding approach to wavelet estimation. The Annals of Statistics, pages 569-595, 2009. Emmanuel Candes and Terence Tao. The dantzig selector: statistical estimation when p is much larger than n. The Annals of Statistics, pages 2313-2351, 2007. Lee Dicker. Dense signals, linear estimators, and out-of- sample prediction for high-dimensional linear models. arXiv preprint arXiv: 1102.2952, 2011. Lee Dicker. Optimal estimation and prediction for dense signals in high-dimensional linear models. arXiv preprint arXiv:1203.4572, 2012. David L Donoho. De-noising by soft-thresholding. Infor¬ mation Theory, IEEE Transactions on, 41(3):613—627, 1995. David L Donoho and Iain M Johnstone. Adapting to unknown smoothness via wavelet shrinkage. Journal of the american statistical association, 90(432): 1200- 1224, 1995. Bradley Efron, Trevor Hastie, Iain Johnstone, Robert Tib- shirani, et al. Least angle regression. The Annals of statistics, 32(2):407^199, 2004. Jianqing Fan and Jinchi Lv. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodol¬ ogy), 70(5):849-911,2008. Dean P Foster and Edward I George. The risk inflation cri¬ terion for multiple regression. The Annals of Statistics, pages 1947-1975, 1994. Dean P Foster, Mark Liberman, and Robert A Stine. Fea- turizing text: Converting text into predictors for regres¬ sion analysis. 2013. SJ Grotzinger and C Witzgall. Projections onto order sim- plexes. Applied mathematics and optimization, 12(1): 247-270, 1984. Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding stmcture with randomness: Probabilistic algo¬ rithms for constructing approximate matrix decomposi¬ tions. SIAM review, 53(2):217-288, 2011. Peter Hall, Jiashun Jin, and Hugh Miller. Feature selection when there are many influential features. arXiv preprint arXiv:0911.4076, 2009. Ker-Chau Li. From stein’s unbiased risk estimates to the method of generalized cross validation. The Annals of Statistics, pages 1352-1377, 1985. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267-288, 1996. John Wright, Allen Y Yang, Arvind Ganesh, Shankar S Sastry, and Yi Ma. Robust face recognition via sparse representation. Pattern Analysis and Machine Intelli¬ gence, IEEE Transactions on, 31(2):210—227, 2009. Xianchao Xie, SC Kou, and Lawrence D Brown. Sure esti¬ mates for a heteroscedastic hierarchical model. Journal of the American Statistical Association, 107(500): 1465- 1479, 2012.